Evaluation of feature extraction methods for query-by-example spoken term detection with low resource languages

In this project we examine different feature extraction methods (Kaldi MFCCs, BUT/Phonexia Bottleneck features, and variants of wav2vec 2.0) for performing QbE-STD with data from language documentation projects.

A walkthrough of the entire experiment pipeline can be found in scripts/README.md. Links to acrhived experiment artefacts uploaded to Zenodo are provided in the last section of this README file. A description of the analyses based on the data is found in analyses/README.md, which also provides links to the pilot analyses with a multilingual model, system evaluations, and the error analysis (all viewable online as GitHub Markdown).

Citation

@misc{san2021leveraging,
      title={Leveraging pre-trained representations to improve access to untranscribed speech from endangered languages}, 
      author={San, Nay and Bartelds, Martijn and Browne, Mitchell and Clifford, Lily and Gibson, Fiona and Mansfield, John and Nash, David and Simpson, Jane and Turpin, Myfany and Vollmer, Maria and Wilmoth, Sasha and Jurafsky, Dan},
      year={2021},
      eprint={2103.14583},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Directory structure

The directory structure for this project roughly follows the Cookiecutter Data Science guidelines.

├── README.md                    <- This top-level README
├── docker-compose.yml           <- Configurations for launching Docker containers
├── qbe-std_feats_eval.Rproj     <- RStudio project file, used to get repository path using R's 'here' package
├── requirements.txt             <- Python package requirements
├── tmp/                         <- Empty directory to download zip files into, if required
├── data/
│   ├── raw/                     <- Immutable data, not modified by scripts
│   │   ├── datasets/            <- Audio data and ground truth labels placed here
│   │   ├── model_checkpoints/   <- wav2vec 2.0 model checkpoint files placed here
│   ├── interim/                         
│   │   ├── features/            <- features generated by extraction scripts (automatically generated)
│   ├── processed/      
│   │   ├── dtw/                 <- results returned by DTW search (automatically generated)
│   │   ├── STDEval/             <- evaluation of DTW searches (automatically generated)
├── scripts/
│   ├── README.md                <- walkthrough for entire experiment pipeline
│   ├── wav_to_shennong-feats.py <- Extraction script for MFCC and BNF features using the Shennong library
│   ├── wav_to_w2v2-feats.py     <- Extraction script for wav2vec 2.0 features
│   ├── feats_to_dtw.py          <- QbE-STD DTW search using extracted features
│   ├── prep_STDEval.R           <- Helper script to generate files needed for STD evaluation
│   ├── gather_mtwv.R            <- Script to gather Maximum Term Weighted Values generated by STDEval
│   ├── STDEval-0.7/             <- NIST STDEval tool
├── analyses/
│   │   ├── data/                <- Final, post-processed data used in analyses
│   │   ├── mtwv.md              <- MTWV figures and statistics reported in paper
│   │   ├── error-analysis.md    <- Error analyses reported in paper
├── paper/
│   │   ├── ASRU2021.tex         <- LaTeX source file of ASRU paper
│   │   ├── ASRU2021.pdf         <- Final paper submitted to ASRU2021

Experiment data and artefacts

With the exception of raw audio and texts from the Australian language documentation projects (for which we do not have permission to release openly) and those from the Mavir corpus (which can be obtained from the original distributor, subject to signing their licence agreement), all other data used in and generated by the experiments are available on Zenodo (see https://zenodo.org/communities/qbe-std_feats_eval). These are:

Dataset: Gronings https://zenodo.org/record/4634878
Experiment artefacts:
- MFCC, BNF and wav2vec 2.0 LibriSpeech 960h features (limited to 50 GB per archive by Zenodo):
  - Archive I (eng-mav, gbb-lg, wbp-jk, and wrl-mb datasets): https://zenodo.org/record/4635438
  - Archive II (gbb-pd, gos-kdl, gup-wat, mwf-jm, pjt-sw01, and wrm-pd): https://zenodo.org/record/4635493
  - Archive III (w2v2-T11 only; all datasets): https://zenodo.org/record/4638385
- wav2vec 2.0 XLSR53 features:
  - Archive I (eng-mav, gbb-lg, wbp-jk, and wrl-mb datasets): https://zenodo.org/record/5504371
  - Archive II (gbb-pd, gos-kdl, gup-wat, mwf-jm, pjt-sw01, and wrm-pd datasets): https://zenodo.org/record/5504471
- DTW search and evaluation data: https://zenodo.org/record/5508217

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyses

analyses

data

data

paper

paper

scripts

scripts

tmp

tmp

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

qbe-std_feats_eval.Rproj

qbe-std_feats_eval.Rproj

requirements.txt

requirements.txt

Repository files navigation

Evaluation of feature extraction methods for query-by-example spoken term detection with low resource languages

Citation

Directory structure

Experiment data and artefacts

About

Releases 2

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
analyses		analyses
data		data
paper		paper
scripts		scripts
tmp		tmp
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
qbe-std_feats_eval.Rproj		qbe-std_feats_eval.Rproj
requirements.txt		requirements.txt

fauxneticien/qbe-std_feats_eval

Folders and files

Latest commit

History

Repository files navigation

Evaluation of feature extraction methods for query-by-example spoken term detection with low resource languages

Citation

Directory structure

Experiment data and artefacts

About

Resources

Stars

Watchers

Forks

Languages